How do social scientists
answer questions using data?

PSCI 2270 - Week 4

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

September 19, 2023

Plan for this week



  1. Question: What data can we collect to study factors that affect election participation?

  2. Applying CLT/LLN for estimation

  3. Comparing group means and logic of causal inference

Plan for this week


  1. Question: What data can we collect to study factors that affect election participation?

Plan for this week


  1. Question: What data can we collect to study factors that affect election participation?
  1. Applying CLT/LLN for estimation

Some Building Blocks


  • Probability:

    • Basis for understanding uncertainty in our estimates
    • Statistics is applied probability
  • Law of Large Numbers

    • Perform the same task over and over (draw an observation, draw a sample, etc.)
    • Average of the results converges to the truth
  • Central Limit Theorem:

    • Add up a lot of independent factors
    • Result follows the normal distribution

Large random samples


  • In real data, we will have a set of \(n\) measurements on a variable: \(X_1\) , \(X_2\), … , \(X_n\)

    • \(X_1\) is the age of the first randomly selected registered voter.
    • \(X_2\) is the age of the second randomly selected registered voter, etc.
  • Empirical analyses: sums or means of these \(n\) measurements

    • All statistical procedures involve a statistic, very often sum or mean.
    • What are the properties of these sums and means?
    • Can the sample mean of age tell us anything about the population distribution of age?
  • Asymptotics: what can we learn as \(n\) gets big?

Stats Lingo: LLN


Law of Large Numbers (LLN)

Let \(X_1\) , … , \(X_n\) be independent and identically distributed random variables with mean \(\mu\) and finite variance \(\sigma^2\). Then, \(\bar{X}_{n}\) converges to \(\mu\) as \(n\) gets large.


  • Intuition: The probability of \(\bar{X}_n\) being “far away” from \(\mu\) goes to \(0\) as \(n\) gets big

  • The distribution of sample mean “collapses” to population mean

Normal Distribution

  • The normal distribution is the classic “bell-shaped” curve.

    • Extremely ubiquitous in statistics
    • mean and variance follow standard notation
    • When \(X\) is distributed normally, we write \(X \sim N ( \mu, \sigma^2 )\)
  • Three key properties:

    • Unimodal: one peak at the mean
    • Symmetric around the mean
    • Everywhere positive: any real value can possibly occur

Stats Lingo: CLT


Central Limit Theorem (CLT)

Let \(X_1\) , … , \(X_n\) be independent and identically distributed random variables with mean \(\mu\) and variance \(\sigma^2\). Then, \(\bar{X}_n\) will be approximately distributed \(N ( \mu, \sigma^2 / n )\) in large samples.


  • Approximation is better as \(n\) goes up \(\Rightarrow\) asymptotics

  • “Sample means tend to be normally distributed as samples get large.”

    • We now know how far away \(\bar{X}_n\) will be from its mean!

Impications of CLT/LLN


  • By CLT, sample mean \(\approx\) normal with mean \(\mu\) and sd of \(\sigma^2 / n\)

  • By empirical rule, sample mean will be within \(2 \times \sigma^2 / n\) of the population mean 95% of the time

  • We usually only 1 sample, so we’ll only get 1 sample mean. So why do we care about LLN/CLT?

    • CLT gives us assurances our sample mean won’t be too far from population mean
    • CLT will also help us create measure of uncertainty for our estimates, standard error (SE):

    \[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]

Putting the Concepts to Work

  • Question: What proportion of the public approves of Biden’s job as president?
  • Latest Gallup poll:

    • Aug. 1-23
    • 1014 adult Americans
    • Telephone interviews
    • Approve (42%), Disapprove (53%)
    • Devil in the Details
  • What can we learn about Biden’s approval in the population from this one sample?

Samples from Population



  • Our focus: simple random sample of size \(n\) from some population \(Y_1\) , … , \(Y_n\)

    • Each individual is independently drawn \(\Rightarrow\) \(i.i.d.\) random variables
    • \(Y_i = 1\) if i approves of Biden, \(Y_i = 0\) otherwise
  • Statistical inference is using data to guess something about the population distribution of \(Y_i\)

Point Estimation


  • Point estimation: providing a single “best guess” as to the value of some fixed, unknown quantity of interest, \(\theta\) (read theta)

    • \(\theta\) is a feature of the population distribution
    • Also called: parameters
  • Examples of quantities of interest ( estimands ):

    • \(\mu = \mathbb{E} [ Y_i ]\): the population mean (turnout rate in the population)
    • \(\sigma^2 = \mathrm{Var}[ Y_i ]\): the population variance
    • \(\mu_1 - \mu_0 = \mathbb{E} [ Y_i (X_i = 1) ] - \mathbb{E} [ Y_i (X_i = 0) ]\): the population Average Treatment Effect (ATE)
  • These are the things we want to learn about

Estimators


Estimator

An estimator, \(\hat{\theta}\), of some parameter \(\theta\), is some function of the sample: \(\hat{\theta} = h(Y_1 , ... , Y_n )\).

  • An estimate is one particular realization of the estimator

    • Ideally we’d like to know the estimation error (bias) \(\theta - \hat\theta\)
    • Problem: \(\theta\) is unknown
    • Solution: figure out the properties of \(\theta\) using probability
  • \(\hat{\theta}\) is a random variable because it is a function of sequence of random draws \(\Rightarrow\) \(\hat{\theta}\) has a distribution

Estimating Biden’s support


  • Parameter \(\theta\): population proportion of adults who support Biden
  • There are many (\(\infty\) ?) different possible estimators:

    • \(\hat{\theta} = Y_n\) : the sample proportion of respondents who support Biden
    • \(\hat{\theta} = Y_1\) : just use the first observation
    • \(\hat{\theta} = \max( Y_1 , ... , Y_n )\) : pick the maximum of all observations
    • \(\hat{\theta} = 0.5\) : always guess 50% support
  • How good are these different estimators?

Survey


  • Assume a simple random sample of n voters: \(n = 1014\)

  • Define random variable \(Y_i\) for Biden’s approval:

    • \(Y_i = 1 \rightarrow\) respondent \(i\) approves of Biden
    • \(Y_i = 0 \rightarrow\) respondent \(i\) disapproves of Biden
  • \(Y_i\) has probability of success \(p\)

    • “probability of success” = “probability of randomly selecting a Biden approver from population”
    • Remember that \(p\) is the expectation of \(Y_i\): \(p = P (Y_i = 1) = \mathbb{E} [ Y_i ]\)

Survey



  • Sample proportion is the same as the sample mean:

\[ \bar{Y} = \frac{1}{n} \sum_{i = 1}^{n} Y_i = \frac{\text{number who support Biden}}{\text{n}} \]

  • \(\theta\) \(= p\)

  • \(\hat\theta\) \(= \bar{Y}\)

Sample Mean Properties



\[ \underbrace{\text{sample mean}}_{\bar{Y}} = \underbrace{\text{population mean}}_{p} + \text{chance error} \]

  • Remember: the sample mean is a random variable

    • Different samples give different sample means
    • Chance error “bumps” sample mean away from population mean
    • \(\Rightarrow \bar{Y}\) has a distribution across repeated samples–sampling distribution

Central Tendency of the Sample Mean



  • Expectation: average of the estimates across repeated samples

    • From LLN: \(\mathbb{E}[\bar{Y}] = \mathbb{E}[ Y_i ] = p\)
    • \(\rightarrow\) chance error is \(0\) on average: \[\mathbb{E}[\bar{Y} − p] = \mathbb{E}[\bar{Y}] − p = 0\]
  • UnBIASedness: Sample proportion is on average equal to the population proportion

Spread of the Sample Mean


  • Standard error: how big is the chance error on average?
  • We can use a special rule to binary random variables to calculate SD:

\[\sqrt{\mathrm{Var}(\bar{Y})} = \sqrt{\frac{p(1 − p)}{n}}\]

  • Problem: we don’t know \(p\)!
  • Solution: estimate the SE

\[\sqrt{\widehat{\mathrm{Var}}(\bar{Y})} = \sqrt{\frac{\bar{Y}(1 − \bar{Y})}{n}} \class{fragment}{= \sqrt{\frac{0.42 (1 − 0.42)}{1014}} \approx 0.016}\]

Confidence Intervals



  • Awesome: Sample proportion is correct on average
  • Awesomer: Get an range of plausible values
  • Confidence interval: way to construct an interval that will contain the true value in some fixed proportion of repeated samples

Using CLT

\[ \bar{Y} − p = \text{chance error}\]

  • How can we figure out a range of plausible chance errors?

    • Find a range of plausible chance errors and add them to Y
  • Central Limit Theorem:

\[\bar{Y} \sim N \left( \underbrace{\mathbb{E}[Y_i]}_{p}, \underbrace{\frac{\mathrm{Var}(Y_i)}{n}}_{\frac{p(1-p)}{n}} \right)\]

  • Chance error: \(\bar{Y} − p\) is approximately normal with mean 0 and SD equal to \(\sqrt{\frac{p(1-p)}{n}}\)

Confidence interval



  • First, choose a confidence level.

    • What percent of chance errors do you want to count as “plausible”?
    • Convention is 95%.
  • \(100 \times (1 − \alpha)\) % confidence interval: \(CI = Y ± z_{\alpha/2} \times SE\)

    • In polling, \(\pm z_{\alpha/2} × SE\) is called the margin of error

CIs for the Gallup Poll


  • Gallup poll: \(\bar{Y} = 0.42\) with an SE of \(0.016\)
  • 90% CI: \[[0.42 − 1.64 × 0.016, 0.42 + 1.64 × 0.016] = [0.393, 0.446]\]
  • 95% CI: \[[0.42 − 1.96 × 0.016, 0.42 + 1.96 × 0.016] = [0.389, 0.451]\]
  • 99% CI: \[[0.42 − 2.58 × 0.016, 0.42 + 2.58 × 0.016] = [0.379, 0.461]\]
  • Less confidence \(\rightarrow\) Wider intervals

95% CI’s


95% CI’s


95% CI’s


95% CI’s


95% CI’s


95% CI’s


Next Time



  • Comparison between samples

  • Potential outcomes framework

  • Logic of experimentation

  • Next week: Short discussion of research questions in class

References